Data

Download Data and Import Necessary Libraries

In [64]:
import time
import warnings
import pandas as pd, numpy as np
%matplotlib inline 
from sklearn.utils import shuffle
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.gridspec as gridspec 
#color = sns.color_palette()
#from wordcloud import WordCloud ,STOPWORDS
#from PIL import Image
import re
from nltk.tokenize import TweetTokenizer
from nltk.stem.wordnet import WordNetLemmatizer 
color = sns.color_palette()

import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
#import locale; 
#print(locale.getdefaultlocale());
from IPython.display import Image
from IPython.core.display import HTML 
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import confusion_matrix

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB
from scipy.sparse import hstack

nltk.download('wordnet')
nltk.download('stopwords')
eng_stopwords = set(stopwords.words("english"))
warnings.filterwarnings("ignore")
tokenizer=TweetTokenizer()
lem = WordNetLemmatizer()

df = pd.read_csv('/Users/yetkineser/Desktop/BDA 502/project/data/train.csv')
df = shuffle(df,random_state=7)
df_others = df.iloc[20000:,]
df = df.iloc[:20000,]
df = df.reset_index(drop=True)
df_others = df_others.reset_index(drop=True)
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/yetkineser/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/yetkineser/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Inside of Data (EDA)

In [65]:
df.head(15)
Out[65]:
id comment_text toxic severe_toxic obscene threat insult identity_hate
0 bde12bf272afe2d2 "==Nagisa Oshima TV documentaries==\nHi, I thi... 0 0 0 0 0 0
1 8177cd93538cc152 "\nSame here, he he I did not notice. Thank y... 0 0 0 0 0 0
2 abdfcf4e47023e9b Why because there was an attack on this articl... 0 0 0 0 0 0
3 01942be64014da76 Mike Reiss==\n\nMike Reiss actually makes the ... 0 0 0 0 0 0
4 bccc084b0deb11c0 this article could do with being tidied up by ... 0 0 0 0 0 0
5 f45c474919ec4dd8 "\n\n Thomson Geer \n\nCan you please create t... 0 0 0 0 0 0
6 e9f6a52407012fbe Indeed it's a complicated matter, thank you fo... 0 0 0 0 0 0
7 014c539181889483 Suggest that the eight main propositions \n\nI... 0 0 0 0 0 0
8 aab4f48f09abd09d This looks like a personal essay of some kind.... 0 0 0 0 0 0
9 c74cce9c1e961c1e "\n\nGreat job\n\nHello Amar, Just stopped by ... 0 0 0 0 0 0
10 331b2e88e116e7aa "\nIf you (incorrectly) wish to state that Xen... 0 0 0 0 0 0
11 8d7eda30a23f3741 "\nNo, I meant if the same person comes back t... 0 0 0 0 0 0
12 a195d84134dcd0ec What the fuck are you talking about? \n\nThey ... 1 0 1 0 0 0
13 dd453f2ed98576a7 NEJM 2015 \n\nhttp://www.nejm.org/doi/full/10.... 0 0 0 0 0 0
14 e23f6b29749267f8 "\n\nMeteorite?\nThere was fragments which mad... 0 0 0 0 0 0
  • Couple of comments
In [66]:
df['comment_text'][3]
Out[66]:
"Mike Reiss==\n\nMike Reiss actually makes the comment about the Family Guy opening scene on the commentary for Lisa's Sax. Also, it is obviously an All in the Family parody.\n\n=="
In [67]:
df['comment_text'][7]
Out[67]:
"Suggest that the eight main propositions \n\nI suggest that there are eight, and not seven main propositions in the Tractatus.\n\nThe eighth is at paragraph 5.1362 which concerns the freedom of the will:\n\n'the freedom of the will consists in the impossibility of knowing actions that still lie in the future.' 80.189.119.38  Alex Smith"
  • The mean, standard deviation and max lengths of comments
In [68]:
lengths = df.comment_text.str.len()
lengths.mean(), lengths.std(), lengths.max()
Out[68]:
(398.2133, 602.5158108035226, 5000)
  • Histogram of comments lengths
In [69]:
lengths.hist();
  • I will create one label to predict which is "any". It will be 1 if any of "toxic", "severe toxic", "obscene", "threat", "insult" or "identity_hate" columns are 1.
In [70]:
label_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
df['none'] = 1-df[label_cols].max(axis=1)
df['any'] = df[label_cols].max(axis=1)
df.describe()
Out[70]:
toxic severe_toxic obscene threat insult identity_hate none any
count 20000.000000 20000.00000 20000.000000 20000.00000 20000.000000 20000.000000 20000.000000 20000.000000
mean 0.096450 0.00985 0.054200 0.00275 0.049900 0.008600 0.898100 0.101900
std 0.295215 0.09876 0.226418 0.05237 0.217744 0.092339 0.302524 0.302524
min 0.000000 0.00000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.00000 0.000000 0.00000 0.000000 0.000000 1.000000 0.000000
50% 0.000000 0.00000 0.000000 0.00000 0.000000 0.000000 1.000000 0.000000
75% 0.000000 0.00000 0.000000 0.00000 0.000000 0.000000 1.000000 0.000000
max 1.000000 1.00000 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000
  • Number of toxic comments we have.
In [71]:
label_cols = ['any','toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
for col in label_cols:
    count = df.groupby(col)['any'].count()
    new_df = pd.concat([count], axis=1)
    new_df.columns = ['count']
    display(new_df.sort_values(by=['count'],ascending=False))
count
any
0 17962
1 2038
count
toxic
0 18071
1 1929
count
severe_toxic
0 19803
1 197
count
obscene
0 18916
1 1084
count
threat
0 19945
1 55
count
insult
0 19002
1 998
count
identity_hate
0 19828
1 172
In [72]:
print("Total comments = ",len(df))
print("Total clean comments = ",len(df)-df['any'].sum())
Total comments =  20000
Total clean comments =  17962
  • Looking visually have many comments in which category.
In [73]:
x=df.iloc[:,2:10].sum()
#plot
plt.figure(figsize=(8,4))
ax= sns.barplot(x.index, x.values, alpha=0.8)
plt.title("# per class")
plt.ylabel('# of Occurrences', fontsize=12)
plt.xlabel('Type ', fontsize=12)
#adding the text labels
rects = ax.patches
labels = x.values
for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')

plt.show()
  • Looking visually how many commnets has have have many category.
In [74]:
rowsums=df.iloc[:,2:].sum(axis=1)
x=rowsums.value_counts()

#plot
plt.figure(figsize=(8,4))
ax = sns.barplot(x.index, x.values, alpha=0.8,color=color[2])
plt.title("Multiple tags per comment")
plt.ylabel('# of Occurrences', fontsize=12)
plt.xlabel('# of tags ', fontsize=12)

#adding the text labels
rects = ax.patches
labels = x.values
for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')

plt.show()
  • Check to null values on dataset.
In [75]:
print("Check for missing values in Train dataset")
null_check=df.isnull().sum()
print(null_check)
Check for missing values in Train dataset
id               0
comment_text     0
toxic            0
severe_toxic     0
obscene          0
threat           0
insult           0
identity_hate    0
none             0
any              0
dtype: int64
  • There are not null values in our dataset.

  • Looking first five rows in our dataset

In [76]:
df.head()
Out[76]:
id comment_text toxic severe_toxic obscene threat insult identity_hate none any
0 bde12bf272afe2d2 "==Nagisa Oshima TV documentaries==\nHi, I thi... 0 0 0 0 0 0 1 0
1 8177cd93538cc152 "\nSame here, he he I did not notice. Thank y... 0 0 0 0 0 0 1 0
2 abdfcf4e47023e9b Why because there was an attack on this articl... 0 0 0 0 0 0 1 0
3 01942be64014da76 Mike Reiss==\n\nMike Reiss actually makes the ... 0 0 0 0 0 0 1 0
4 bccc084b0deb11c0 this article could do with being tidied up by ... 0 0 0 0 0 0 1 0
  • Looking correlation between tags of comments.
In [77]:
temp_df=df.iloc[:,2:-3]
# filter temp by removing clean comments
# temp_df=temp_df[~train.clean]

corr=temp_df.corr()
plt.figure(figsize=(10,8))
sns.heatmap(corr,
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values, annot=True)
Out[77]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a18cff1d0>
  • Most corelated categories are obscene and insults.

Divide Train and Test Dataset and Cross Validation

In [105]:
PATH = "/Users/yetkineser/Desktop/BDA 502/project/photos/"
Image(filename = PATH + "crossvalidation.png", width=800, height=600)
Out[105]:
  • My project has two steps firstly. But i decided to add new two steps later.
    • In first steps i divide my data train and test. And on train set I use 5 fold cross validation. After i learn my model on train dataset and predict on test dataset.
    • In second step, i multiply 10 times my target = 1 data in train dataset and add train.
In [79]:
total_rows = (len(df))
train_rows = round(0.8*total_rows)
train = df.iloc[:train_rows,]
train_others = train # i added this later
test = df.iloc[train_rows-total_rows:,]
train_2=train.iloc[:,0:2]
test_2=test.iloc[:,0:2]
df_2=df.iloc[:,0:2]
df_2=df_2.reset_index(drop=True)
print("- I have ",total_rows, " rows on my data set ")
print("- I have ",train_rows, " rows on my first train set ")
print("- I have ",total_rows-train_rows, " rows on my test set ")
- I have  20000  rows on my data set 
- I have  16000  rows on my first train set 
- I have  4000  rows on my test set 

Cleaning Comments

  • Create Aphost dictionary for change contraction.
In [80]:
#https://drive.google.com/file/d/0B1yuv8YaUVlZZ1RzMFJmc1ZsQmM/view
# Aphost lookup dict
APPO = {
"aren't" : "are not",
"can't" : "can not",
"couldn't" : "could not",
"didn't" : "did not",
"doesn't" : "does not",
"don't" : "do not",
"hadn't" : "had not",
"hasn't" : "has not",
"haven't" : "have not",
"he'd" : "he would",
"he'll" : "he will",
"he's" : "he is",
"i'd" : "I would",
"i'd" : "I had",
"i'll" : "I will",
"i'm" : "I am",
"isn't" : "is not",
"it's" : "it is",
"it'll":"it will",
"i've" : "I have",
"let's" : "let us",
"mightn't" : "might not",
"mustn't" : "must not",
"shan't" : "shall not",
"she'd" : "she would",
"she'll" : "she will",
"she's" : "she is",
"shouldn't" : "should not",
"that's" : "that is",
"there's" : "there is",
"they'd" : "they would",
"they'll" : "they will",
"they're" : "they are",
"they've" : "they have",
"we'd" : "we would",
"we're" : "we are",
"weren't" : "were not",
"we've" : "we have",
"what'll" : "what will",
"what're" : "what are",
"what's" : "what is",
"what've" : "what have",
"where's" : "where is",
"who'd" : "who would",
"who'll" : "who will",
"who're" : "who are",
"who's" : "who is",
"who've" : "who have",
"won't" : "will not",
"wouldn't" : "would not",
"you'd" : "you would",
"you'll" : "you will",
"you're" : "you are",
"you've" : "you have",
"'re": " are",
"wasn't": "was not",
"we'll":" will",
"didn't": "did not",
"tryin'":"trying"
}
In [81]:
corpus=df_2.comment_text
train_text = train_2.comment_text
test_text = test_2.comment_text
  • Create clean functions to clean comments.
    • lowercasing,
    • remove leaky elements (like ip, users)
    • remove usernames
    • split comments into words.
    • remove numbers
In [82]:
def clean(comment):
    """
    This function receives comments and returns clean word-list
    """
    #Convert to lower case , so that Hi and hi are the same
    comment=comment.lower()
    #remove \n
    comment=re.sub("\\n"," ",comment)
    comment=re.sub("/"," ",comment) # added 
    comment=re.sub("•"," ",comment) # added
    # remove leaky elements like ip,user
    comment=re.sub("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}","",comment)
    #removing usernames
    comment=re.sub("\[\[.*\]","",comment)
    
    #Split the sentences into words
    words=tokenizer.tokenize(comment)

    
    # (')aphostophe  replacement (ie)   you're --> you are  
    # ( basic dictionary lookup : master dictionary present in a hidden block of code)
    words=[APPO[word] if word in APPO else word for word in words]
    words=[lem.lemmatize(word, "v") for word in words]
    words = [w for w in words if not w in eng_stopwords]
    
    clean_sent=" ".join(words)
    # remove any non alphanum,digit character
    clean_sent=re.sub("\W+"," ",clean_sent)
    clean_sent=re.sub("  "," ",clean_sent)
    clean_sent=re.sub(r'[0-9]+', '', clean_sent)

    return(clean_sent)
  • One of comment before cleaning
In [83]:
corpus.iloc[5]
Out[83]:
'"\n\n Thomson Geer \n\nCan you please create this page for me. See all verifiable references and language\n\nThomson Geer is a independent Australian corporate law firm. It is the seventh largest independent firm in the country by number of lawyers.\n\nHistory\nThe firm was founded in 1887 in Adelaide, South Australia and has grown through various mergers to create a extensive Australian mainland presence.\n\nOn 31 March 2014 Thomsons Lawyers and Herbert Geer merged their legal practices to form Thomson Geer. \n\nOffices\nThomson Geer has offices in Sydney, Melbourne, Adelaide and Brisbane in Australia. \n\nFirm size\nThomson Geer has 80 Partners and another 250+ lawyers operating out of four offices. \n\nThomson Geer\'s clients are principally spread across 3 classifications: \n  \n•Australian Stock Exchange Top 200;  \n•Major Global Foreign Corporations; and  \n•Australian Stock Exchange Mid and Small Caps/Government Enterprises/Large and Medium Private Corporations.\n\nExternal links\nFirm website\n\nReferences\n"'
  • Same comment after cleaning
In [84]:
clean(corpus.iloc[5])
Out[84]:
' thomson geer please create page see verifiable reference language thomson geer independent australian corporate law firm seventh largest independent firm country number lawyers history firm found  adelaide south australia grow various mergers create extensive australian mainland presence  march  thomsons lawyers herbert geer merge legal practice form thomson geer offices thomson geer offices sydney melbourne adelaide brisbane australia firm size thomson geer  partner another  lawyers operate four offices thomson geer s clients principally spread across  classifications australian stock exchange top  major global foreign corporations australian stock exchange mid small cap government enterprises large medium private corporations external link firm website reference '
In [85]:
clean_corpus=corpus.apply(lambda x :clean(x))

TF(Term Frequency) - IDF(Inverse Document Frequency)

In [86]:
def top_feats_by_class(Xtr, y, features, min_tfidf=0.1, top_n=25):
    ''' Return a list of dfs, where each df holds top_n features and their mean tfidf value
        calculated across documents with the same class label. '''
    dfs = []
    labels = np.unique(y)
    for label in labels:
        ids = np.where(y==label)
        feats_df = top_mean_feats(Xtr, features, ids, min_tfidf=min_tfidf, top_n=top_n)
        feats_df.label = label
        dfs.append(feats_df)
    return dfs
In [87]:
tfv = TfidfVectorizer(min_df=200,  max_features=10000, 
            strip_accents='unicode', analyzer='word',ngram_range=(1,1),
            use_idf=1,smooth_idf=1,sublinear_tf=1,
            stop_words = 'english')
tfv.fit(clean_corpus)
features = np.array(tfv.get_feature_names())

df_unigrams =  tfv.transform(clean_corpus.iloc[:df.shape[0]])
In [88]:
#serperate train and test features
df_feats=df.iloc[0:len(df),]
#join the tags
df_tags=df.iloc[:,2:]
df_feats=pd.concat([df_feats,df_tags],axis=1)
In [89]:
def top_tfidf_feats(row, features, top_n=50):
    ''' Get top n tfidf values in row and return them with their corresponding feature names.'''
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['feature', 'tfidf']
    return df

def top_feats_in_doc(Xtr, features, row_id, top_n=50):
    ''' Top tfidf features in specific document (matrix row) '''
    row = np.squeeze(Xtr[row_id].toarray())
    return top_tfidf_feats(row, features, top_n)

def top_mean_feats(Xtr, features, grp_ids, min_tfidf=0.1, top_n=50):
    ''' Return the top n features that on average are most important amongst documents in rows
        indentified by indices in grp_ids. '''
    
    D = Xtr[grp_ids].toarray()

    D[D < min_tfidf] = 0
    tfidf_means = np.mean(D, axis=0)
    return top_tfidf_feats(tfidf_means, features, top_n)

# modified for multilabel milticlass
def top_feats_by_class(Xtr, features, min_tfidf=0.005, top_n=50):
    ''' Return a list of dfs, where each df holds top_n features and their mean tfidf value
        calculated across documents with the same class label. '''
    dfs = []
    cols=df_tags.columns
    for col in cols:
        ids = df_tags.index[df_tags[col]==1]
        feats_df = top_mean_feats(Xtr, features, ids, min_tfidf=min_tfidf, top_n=top_n)
        feats_df.label = label
        dfs.append(feats_df)
    return dfs
In [90]:
tfidf_top_n_per_lass=top_feats_by_class(df_unigrams,features)
In [91]:
from sklearn.feature_extraction.text import TfidfVectorizer
word_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}',
    stop_words='english',
    ngram_range=(1, 1),
    max_features=10000)
word_vectorizer.fit(corpus)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)

char_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='char',
    stop_words='english',
    ngram_range=(2, 6),
    max_features=50000)
char_vectorizer.fit(corpus)
train_char_features = char_vectorizer.transform(train_text)
test_char_features = char_vectorizer.transform(test_text)

Metrics Which I Used in This Project

  • I used four different metrics which are
    • Accuracy
    • Precision
    • Recall
    • AUC(Area Under the Curve)
  • You can find definition of accuracy, recall and precision at the table below.
In [94]:
Image(filename = PATH + "metrics.png", width=700, height=700)
Out[94]:
  • AOC is calculated with sensitivity(True positive rate) and specifity(False Negative rate), it is a very good metrics to calculate how good we ordered probabilities of our classification.
  • You can find about AUC at the link below
  • In every study, different metrics can be important in this study i try to find toxic comments. So if i can not find one of toxic comments, it can be problem. So for this study recall is my most important metrics. Second important one is AOC. Of course the other metrics are important. But most important one is recall and AOC.
  • Also i compare working time of algorithms in this project.

Logistic Regression

In [95]:
Image(filename = PATH + "logistic regression.png", width=700, height=600)
Out[95]:
  • There are so many paramaters in sklearn logistic regression packages, after few tries with cross validation, i have decided to use C=5 and solver = 'sag' with other default parameters.
In [96]:
from sklearn.linear_model import LogisticRegression
class_names = ['any']
from sklearn.model_selection import cross_val_score
from scipy.sparse import hstack

train_features = hstack([train_char_features, train_word_features])
test_features = hstack([test_char_features, test_word_features])

scores = []
scores2 = []
scores3 = []
test_pred = pd.DataFrame.from_dict({'id': test['id']})
train_pred = pd.DataFrame.from_dict({'id': train['id']})

time1=time.time()

for class_name in class_names:
    train_target = train[class_name]
    classifier = LogisticRegression(C=5, solver='sag')

    cv_score = cross_val_score(classifier, train_features, train_target, cv=5, scoring='accuracy')
    scores.append(cv_score)
    cv_score = cross_val_score(classifier, train_features, train_target, cv=5, scoring='precision')
    scores2.append(cv_score)
    cv_score = cross_val_score(classifier, train_features, train_target, cv=5, scoring='recall')
    scores3.append(cv_score)

time2=time.time()

print('Average CV accuracy is {}'.format(round(np.mean(scores),5)))
print('Standard Deviation of CV accuracy is {}'.format(round(np.std(scores),5)))
print('Average CV precion is {}'.format(round(np.mean(scores2),5)))
print('Standard Deviation of CV precision is {}'.format(round(np.std(scores2),5)))
print('Average CV recall is {}'.format(round(np.mean(scores3),5)))
print('Standard Deviation of CV recall is {}'.format(round(np.std(scores3),5)))
print("Time of cross validation",round(time2-time1,5))
Average CV accuracy is 0.95219
Standard Deviation of CV accuracy is 0.00197
Average CV precion is 0.92719
Standard Deviation of CV precision is 0.01117
Average CV recall is 0.58055
Standard Deviation of CV recall is 0.01597
Time of cross validation 94.55538
In [98]:
classifier = LogisticRegression(C=5, solver='sag')

time1=time.time()
classifier.fit(train_features, train_target)
time2=time.time()
train_pred[class_name] = classifier.predict_proba(train_features)[:, 1]   
time3=time.time()
test_pred[class_name] = classifier.predict_proba(test_features)[:, 1]
time4=time.time()

print("Training time = ",round(time2-time1,5))
print("Test time = ",round(time4-time3,5))
Training time =  6.8265
Test time =  0.58436
In [99]:
test["pred"]=test_pred["any"]>0.5
train["pred"]=train_pred["any"]>0.5
print("Accuracy score of train set : ",round(accuracy_score(train["any"], train["pred"]),5))
print("Accuracy score of test set : ",round(accuracy_score(test["any"], test["pred"]),5))
print("Precision score of train set : ",round(precision_score(train["any"], train["pred"]),5))
print("Precision score of test set : ",round(precision_score(test["any"], test["pred"]),5))
print("Recall score of train set : ",round(recall_score(train["any"], train["pred"]),5))
print("Recall score of test set : ",round(recall_score(test["any"], test["pred"]),5))
Accuracy score of train set :  0.99488
Accuracy score of test set :  0.96125
Precision score of train set :  0.99809
Precision score of test set :  0.9375
Recall score of train set :  0.95198
Recall score of test set :  0.64885
In [100]:
# calculate the fpr and tpr for all thresholds of the classification
from sklearn import metrics
preds = test["pred"]
fpr, tpr, threshold = metrics.roc_curve(test["any"], test_pred["any"])
fpr_t, tpr_t, threshold_t = metrics.roc_curve(train["any"], train_pred["any"])
roc_auc = metrics.auc(fpr, tpr)
roc_auc_t = metrics.auc(fpr_t, tpr_t)

# method I: plt
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'TEST_AUC = %0.5f' % roc_auc)
plt.plot(fpr_t, tpr_t, 'r', label = 'TRAIN_AUC = %0.5f' % roc_auc_t)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
  • Confusion Matrix of Train set
In [141]:
confusion_matrix_train = confusion_matrix(train["any"], train["pred"])
print(confusion_matrix_train)
[[14352     3]
 [   79  1566]]
  • Confusion Matrix of Test set
In [142]:
confusion_matrix_test = confusion_matrix(test["any"], test["pred"])
print(confusion_matrix_test)
[[3590   17]
 [ 138  255]]

Naive Bayes Classifier

In [143]:
class_names = ['any']

train_features = hstack([train_char_features, train_word_features])
test_features = hstack([test_char_features, test_word_features])

scores = []
scores2 = []
scores3 = []
test_pred = pd.DataFrame.from_dict({'id': test['id']})
train_pred = pd.DataFrame.from_dict({'id': train['id']})

time1=time.time()

for class_name in class_names:
    train_target = train[class_name]
    classifier = MultinomialNB()

    cv_score = cross_val_score(classifier, train_features, train_target, cv=5, scoring='accuracy')
    scores.append(cv_score)
    cv_score = cross_val_score(classifier, train_features, train_target, cv=5, scoring='precision')
    scores2.append(cv_score)
    cv_score = cross_val_score(classifier, train_features, train_target, cv=5, scoring='recall')
    scores3.append(cv_score)

time2=time.time()

print('Average CV accuracy is {}'.format(round(np.mean(scores),5)))
print('Standard Deviation of CV accuracy is {}'.format(round(np.std(scores),5)))
print('Average CV precion is {}'.format(round(np.mean(scores2),5)))
print('Standard Deviation of CV precision is {}'.format(round(np.std(scores2),5)))
print('Average CV recall is {}'.format(round(np.mean(scores3),5)))
print('Standard Deviation of CV recall is {}'.format(round(np.std(scores3),5)))
print("Time of cross validation",round(time2-time1,5))
Average CV accuracy is 0.93519
Standard Deviation of CV accuracy is 0.00116
Average CV precion is 0.94262
Standard Deviation of CV precision is 0.01862
Average CV recall is 0.39392
Standard Deviation of CV recall is 0.01472
Time of cross validation 6.79276
In [207]:
classifier = MultinomialNB()

time1=time.time()
classifier.fit(train_features, train_target)
time2=time.time()
train_pred[class_name] = classifier.predict_proba(train_features)[:, 1]   
time3=time.time()
test_pred[class_name] = classifier.predict_proba(test_features)[:, 1]
time4=time.time()

print("Training time = ",round(time2-time1,5))
print("Test time = ",round(time4-time3,5))
Training time =  0.4698
Test time =  0.08203
In [208]:
test["pred"]=test_pred["any"]>0.5
train["pred"]=train_pred["any"]>0.5
print("Accuracy score of train set : ",round(accuracy_score(train["any"], train["pred"]),5))
print("Accuracy score of test set : ",round(accuracy_score(test["any"], test["pred"]),5))
print("Precision score of train set : ",round(precision_score(train["any"], train["pred"]),5))
print("Precision score of test set : ",round(precision_score(test["any"], test["pred"]),5))
print("Recall score of train set : ",round(recall_score(train["any"], train["pred"]),5))
print("Recall score of test set : ",round(recall_score(test["any"], test["pred"]),5))
Accuracy score of train set :  0.94656
Accuracy score of test set :  0.94425
Precision score of train set :  0.94482
Precision score of test set :  0.925
Recall score of train set :  0.51003
Recall score of test set :  0.47074
In [209]:
# calculate the fpr and tpr for all thresholds of the classification
from sklearn import metrics
preds = test["pred"]
fpr, tpr, threshold = metrics.roc_curve(test["any"], test_pred["any"])
fpr_t, tpr_t, threshold_t = metrics.roc_curve(train["any"], train_pred["any"])
roc_auc = metrics.auc(fpr, tpr)
roc_auc_t = metrics.auc(fpr_t, tpr_t)

# method I: plt
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'TEST_AUC = %0.5f' % roc_auc)
plt.plot(fpr_t, tpr_t, 'r', label = 'TRAIN_AUC = %0.5f' % roc_auc_t)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
  • Confusion Matrix of Train set
In [210]:
confusion_matrix_train = confusion_matrix(train["any"], train["pred"])
print(confusion_matrix_train)
[[14306    49]
 [  806   839]]
  • Confusion Matrix of Test set
In [211]:
confusion_matrix_test = confusion_matrix(test["any"], test["pred"])
print(confusion_matrix_test)
[[3592   15]
 [ 208  185]]

AdaBoost Classifier

  • Adaboost is one of the ensemble machine learning methods. It generally use for high predicted power. But it works slowly compared to naive bayes and logistic regression.
In [132]:
class_names = ['any']

train_features = hstack([train_char_features, train_word_features])
test_features = hstack([test_char_features, test_word_features])

scores = []
scores2 = []
scores3 = []
test_pred = pd.DataFrame.from_dict({'id': test['id']})
train_pred = pd.DataFrame.from_dict({'id': train['id']})

time1=time.time()

for class_name in class_names:
    train_target = train[class_name]
    classifier = AdaBoostClassifier()

    cv_score = cross_val_score(classifier, train_features, train_target, cv=5, scoring='accuracy')
    scores.append(cv_score)
    cv_score = cross_val_score(classifier, train_features, train_target, cv=5, scoring='precision')
    scores2.append(cv_score)
    cv_score = cross_val_score(classifier, train_features, train_target, cv=5, scoring='recall')
    scores3.append(cv_score)

time2=time.time()

print('Average CV accuracy is {}'.format(round(np.mean(scores),5)))
print('Standard Deviation of CV accuracy is {}'.format(round(np.std(scores),5)))
print('Average CV precion is {}'.format(round(np.mean(scores2),5)))
print('Standard Deviation of CV precision is {}'.format(round(np.std(scores2),5)))
print('Average CV recall is {}'.format(round(np.mean(scores3),5)))
print('Standard Deviation of CV recall is {}'.format(round(np.std(scores3),5)))
print("Time of cross validation",round(time2-time1,5))
Average CV accuracy is 0.94525
Standard Deviation of CV accuracy is 0.00263
Average CV precion is 0.84341
Standard Deviation of CV precision is 0.02683
Average CV recall is 0.57508
Standard Deviation of CV recall is 0.02286
Time of cross validation 1107.30808
In [212]:
classifier = AdaBoostClassifier()

time1=time.time()
classifier.fit(train_features, train_target)
time2=time.time()
train_pred[class_name] = classifier.predict_proba(train_features)[:, 1]   
time3=time.time()
test_pred[class_name] = classifier.predict_proba(test_features)[:, 1]
time4=time.time()

print("Training time = ",round(time2-time1,5))
print("Test time = ",round(time4-time3,5))
Training time =  95.40115
Test time =  0.78924
In [213]:
test["pred"]=test_pred["any"]>0.5
train["pred"]=train_pred["any"]>0.5
print("Accuracy score of train set : ",round(accuracy_score(train["any"], train["pred"]),5))
print("Accuracy score of test set : ",round(accuracy_score(test["any"], test["pred"]),5))
print("Precision score of train set : ",round(precision_score(train["any"], train["pred"]),5))
print("Precision score of test set : ",round(precision_score(test["any"], test["pred"]),5))
print("Recall score of train set : ",round(recall_score(train["any"], train["pred"]),5))
print("Recall score of test set : ",round(recall_score(test["any"], test["pred"]),5))
Accuracy score of train set :  0.95206
Accuracy score of test set :  0.9505
Precision score of train set :  0.87457
Precision score of test set :  0.81759
Recall score of train set :  0.6231
Recall score of test set :  0.63868
In [214]:
# calculate the fpr and tpr for all thresholds of the classification
from sklearn import metrics
preds = test["pred"]
fpr, tpr, threshold = metrics.roc_curve(test["any"], test_pred["any"])
fpr_t, tpr_t, threshold_t = metrics.roc_curve(train["any"], train_pred["any"])
roc_auc = metrics.auc(fpr, tpr)
roc_auc_t = metrics.auc(fpr_t, tpr_t)

# method I: plt
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'TEST_AUC = %0.5f' % roc_auc)
plt.plot(fpr_t, tpr_t, 'r', label = 'TRAIN_AUC = %0.5f' % roc_auc_t)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
  • Confusion Matrix of Train set
In [215]:
confusion_matrix_train = confusion_matrix(train["any"], train["pred"])
print(confusion_matrix_train)
[[14208   147]
 [  620  1025]]
  • Confusion Matrix of Test set
In [216]:
confusion_matrix_test = confusion_matrix(test["any"], test["pred"])
print(confusion_matrix_test)
[[3551   56]
 [ 142  251]]

Cross Validations and Trains&Tests Results

In [62]:
Image(filename = PATH + "metricsresults1.png", width=800, height=600)
Out[62]:
  • Based on working time Naive Bayes is fastest. And based on training time Adaboost is more than 300 times slower from Naive Bayes and 30 times slower than logistic regression. Based on test time Logistic Regression is faster than Naive Bayes.
  • Based on cv and test results best algorithms is logistic regression for recall metric(little difference with Adaboost) and AOC metric. But we can say 65% recall is so low.
  • Accuracy metrics are so good for all of them but our train dataset seperation is 10%-90% so if i say all results = 0, my accuracy still be 0.90.
  • Based on logistic regression, i look some of False Negative(FN) Results. Because if i can decrease FN number, my recall will be increase.
In [101]:
test_recall = test.loc[(test['any'] == 1) & (test['pred']==False)]
test_recall.ix[:, ["comment_text","any","pred"]]
test_recall = test_recall.reset_index(drop=True)

train_recall = train.loc[(train['any'] == 1) & (train['pred']==False)]
In [102]:
test_recall["comment_text"][7]
Out[102]:
'"\n Well go verify it, ""Dick and Jane get sexually mutilated"" page 18.  "'
In [103]:
test_recall["comment_text"][3]
Out[103]:
"Women bus drivers \n\nThoughts? I don't like them.  What about you? Do you think it's right they should drive buses; or do you think they should stick to washing up? 164.39.151.107"
In [104]:
test_recall["comment_text"][12]
Out[104]:
'":A certain admin will block you if you tell the truth about a certain user\'s dirty tricks. But she won\'t do anything to the user who pulled the dirty tricks. She only ""punishes"" the victim of the dirty tricks. The perpetrator goes scot free. That\'s ""Wikipedia justice"", as practiced by admin pschemp.  \n\nApparently she wants no one to know that Ryulong deleted my request for help on Khoikhoi\'s talk page, falsely describing my request for help as ""vandalism"" in his edit summary. Oops! Was it ""uncivil"" again, for me to tell the truth? If so, then prosecutors in court will never be able to tell the truth about crimes, because it would be an ""uncivil"" ""personal attack"" to use words such as ""lie"", ""dishonest"", ""vandal"", ""thief"", ""murderer"", ""rapist"", etc. The judge would get too upset by such ""uncivil comments"". Come to think of it, every time an admin describes a user\'s action as ""vandalism"", or a user as a ""vandal"", that\'s an improper ""personal attack"", according to pschemp\'s logic. If pschemp treats people equally (not likely), then she better block all the admins who make those ""personal attacks"" of describing a particular user\'s behavior as ""vandalism"". And she better block Ryulong for using that ""personal attack"" against me in his false edit summaries.  \n\n"'
  • After looking these comments, i thought my train dataset has not enough tagged(any = 1) comments. So i decided to add new two steps to my work.

New Two Steps for Increasing Recall Metrics

In [106]:
Image(filename = PATH + "steps2.png", width=800, height=600)
Out[106]:
  • In third step i multiply 10 times my tagged data in train dataset and add again my train dataset.
  • In fourth step i add completely new dataset on the my train dataset.
  • I did not change anything on my test dataset. In 2,3 and 4 steps my test dataset is same. So, I can compare test results.

Step 3 : Added 10 Times Same Tagged Data on My Train Dataset

  • Finding tagged data on my train dataset.
In [220]:
train_any_1 = train.loc[(train['any'] == 1)]
train_any_1.ix[:, ["comment_text","any","pred"]]
train_any_1 = train_any_1.reset_index(drop=True)
  • Adding 10 times.
In [222]:
new_train = pd.concat([train, train_any_1, train_any_1, train_any_1,
                           train_any_1, train_any_1, train_any_1,
                       train_any_1,train_any_1, train_any_1, train_any_1])
In [223]:
train_rows = (len(new_train))
test_rows = (len(test))
train_rows = round(train_rows)
train = df.iloc[:train_rows,]
train_2=new_train.iloc[:,0:2]
test_2=test.iloc[:,0:2]
df = pd.concat([train_2,test_2])
df_2=df.iloc[:,0:2]
df_2=df_2.reset_index(drop=True)
print("- I have ",train_rows, " rows on my new regenerated train set ")
print("- I have ",test_rows, " rows on my test set ")
- I have  32450  rows on my new regenerated train set 
- I have  4000  rows on my test set 
In [224]:
corpus=df_2.comment_text
train_text = train_2.comment_text
test_text = test_2.comment_text
  • I clean my corpus again with functions i write above.
In [225]:
clean_corpus=corpus.apply(lambda x :clean(x))
In [226]:
tfv = TfidfVectorizer(min_df=200,  max_features=10000, 
            strip_accents='unicode', analyzer='word',ngram_range=(1,1),
            use_idf=1,smooth_idf=1,sublinear_tf=1,
            stop_words = 'english')
tfv.fit(clean_corpus)
features = np.array(tfv.get_feature_names())

df_unigrams =  tfv.transform(clean_corpus.iloc[:df.shape[0]])
In [227]:
#serperate train and test features
df_feats=df.iloc[0:len(df),]
#join the tags
df_tags=df.iloc[:,2:]
df_feats=pd.concat([df_feats,df_tags],axis=1)
In [228]:
tfidf_top_n_per_lass=top_feats_by_class(df_unigrams,features)
In [229]:
from sklearn.feature_extraction.text import TfidfVectorizer
word_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}',
    stop_words='english',
    ngram_range=(1, 1),
    max_features=10000)
word_vectorizer.fit(corpus)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)

char_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='char',
    stop_words='english',
    ngram_range=(2, 6),
    max_features=50000)
char_vectorizer.fit(corpus)
train_char_features = char_vectorizer.transform(train_text)
test_char_features = char_vectorizer.transform(test_text)

Logistic Regression with Regenerated Data

In [230]:
train_features = hstack([train_char_features, train_word_features])
test_features = hstack([test_char_features, test_word_features])
train_target = new_train[class_name]

test_pred = pd.DataFrame.from_dict({'id': test['id']})
train_pred = pd.DataFrame.from_dict({'id': new_train['id']})


classifier = LogisticRegression(C=5, solver='sag')
time1=time.time()
classifier.fit(train_features, train_target)
time2=time.time()
train_pred[class_name] = classifier.predict_proba(train_features)[:, 1]   
time3=time.time()
test_pred[class_name] = classifier.predict_proba(test_features)[:, 1]
time4=time.time()

test["pred"]=test_pred["any"]>0.5
new_train["pred"]=train_pred["any"]>0.5
print("Training time = ",round(time2-time1,5))
print("Test time = ",round(time4-time3,5))
print("Accuracy score of train set : ",round(accuracy_score(new_train["any"], new_train["pred"]),5))
print("Accuracy score of test set : ",round(accuracy_score(test["any"], test["pred"]),5))
print("Precision score of train set : ",round(precision_score(new_train["any"], new_train["pred"]),5))
print("Precision score of test set : ",round(precision_score(test["any"], test["pred"]),5))
print("Recall score of train set : ",round(recall_score(new_train["any"], new_train["pred"]),5))
print("Recall score of test set : ",round(recall_score(test["any"], test["pred"]),5))
Training time =  14.85858
Test time =  0.59225
Accuracy score of train set :  0.99895
Accuracy score of test set :  0.9555
Precision score of train set :  0.99812
Precision score of test set :  0.79452
Recall score of train set :  1.0
Recall score of test set :  0.73791
In [232]:
# calculate the fpr and tpr for all thresholds of the classification
from sklearn import metrics
preds = test["pred"]
fpr, tpr, threshold = metrics.roc_curve(test["any"], test_pred["any"])
fpr_t, tpr_t, threshold_t = metrics.roc_curve(new_train["any"], train_pred["any"])
roc_auc = metrics.auc(fpr, tpr)
roc_auc_t = metrics.auc(fpr_t, tpr_t)

# method I: plt
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'TEST_AUC = %0.5f' % roc_auc)
plt.plot(fpr_t, tpr_t, 'r', label = 'TRAIN_AUC = %0.5f' % roc_auc_t)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
In [233]:
confusion_matrix_test = confusion_matrix(test["any"], test["pred"])
print(confusion_matrix_test)
[[3532   75]
 [ 103  290]]

Naive Bayes Classifier with Regenerated Data

In [234]:
classifier = MultinomialNB()
time1=time.time()
classifier.fit(train_features, train_target)
time2=time.time()
train_pred[class_name] = classifier.predict_proba(train_features)[:, 1]   
time3=time.time()
test_pred[class_name] = classifier.predict_proba(test_features)[:, 1]
time4=time.time()

test["pred"]=test_pred["any"]>0.5
new_train["pred"]=train_pred["any"]>0.5
print("Training time = ",round(time2-time1,5))
print("Test time = ",round(time4-time3,5))
print("Accuracy score of train set : ",round(accuracy_score(new_train["any"], new_train["pred"]),5))
print("Accuracy score of test set : ",round(accuracy_score(test["any"], test["pred"]),5))
print("Precision score of train set : ",round(precision_score(new_train["any"], new_train["pred"]),5))
print("Precision score of test set : ",round(precision_score(test["any"], test["pred"]),5))
print("Recall score of train set : ",round(recall_score(new_train["any"], new_train["pred"]),5))
print("Recall score of test set : ",round(recall_score(test["any"], test["pred"]),5))
Training time =  0.85328
Test time =  0.0631
Accuracy score of train set :  0.92619
Accuracy score of test set :  0.89025
Precision score of train set :  0.92765
Precision score of test set :  0.46714
Recall score of train set :  0.94103
Recall score of test set :  0.83206
In [236]:
# calculate the fpr and tpr for all thresholds of the classification
from sklearn import metrics
preds = test["pred"]
fpr, tpr, threshold = metrics.roc_curve(test["any"], test_pred["any"])
fpr_t, tpr_t, threshold_t = metrics.roc_curve(new_train["any"], train_pred["any"])
roc_auc = metrics.auc(fpr, tpr)
roc_auc_t = metrics.auc(fpr_t, tpr_t)

# method I: plt
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'TEST_AUC = %0.5f' % roc_auc)
plt.plot(fpr_t, tpr_t, 'r', label = 'TRAIN_AUC = %0.5f' % roc_auc_t)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
In [237]:
confusion_matrix_test = confusion_matrix(test["any"], test["pred"])
print(confusion_matrix_test)
[[3234  373]
 [  66  327]]

AdaBoost Classifier with Regenerated Data

In [238]:
classifier = AdaBoostClassifier()
time1=time.time()
classifier.fit(train_features, train_target)
time2=time.time()
train_pred[class_name] = classifier.predict_proba(train_features)[:, 1]   
time3=time.time()
test_pred[class_name] = classifier.predict_proba(test_features)[:, 1]
time4=time.time()

test["pred"]=test_pred["any"]>0.5
new_train["pred"]=train_pred["any"]>0.5
print("Training time = ",round(time2-time1,5))
print("Test time = ",round(time4-time3,5))
print("Accuracy score of train set : ",round(accuracy_score(new_train["any"], new_train["pred"]),5))
print("Accuracy score of test set : ",round(accuracy_score(test["any"], test["pred"]),5))
print("Precision score of train set : ",round(precision_score(new_train["any"], new_train["pred"]),5))
print("Precision score of test set : ",round(precision_score(test["any"], test["pred"]),5))
print("Recall score of train set : ",round(recall_score(new_train["any"], new_train["pred"]),5))
print("Recall score of test set : ",round(recall_score(test["any"], test["pred"]),5))
Training time =  142.928
Test time =  0.74408
Accuracy score of train set :  0.86703
Accuracy score of test set :  0.86275
Precision score of train set :  0.89611
Precision score of test set :  0.40347
Recall score of train set :  0.8614
Recall score of test set :  0.82952
In [239]:
# calculate the fpr and tpr for all thresholds of the classification
from sklearn import metrics
preds = test["pred"]
fpr, tpr, threshold = metrics.roc_curve(test["any"], test_pred["any"])
fpr_t, tpr_t, threshold_t = metrics.roc_curve(new_train["any"], train_pred["any"])
roc_auc = metrics.auc(fpr, tpr)
roc_auc_t = metrics.auc(fpr_t, tpr_t)

# method I: plt
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'TEST_AUC = %0.5f' % roc_auc)
plt.plot(fpr_t, tpr_t, 'r', label = 'TRAIN_AUC = %0.5f' % roc_auc_t)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
In [240]:
confusion_matrix_test = confusion_matrix(test["any"], test["pred"])
print(confusion_matrix_test)
[[3125  482]
 [  67  326]]

Step 2 and Step 3 Results

In [112]:
Image(filename = PATH + "results2.png", width=800, height=600)
Out[112]:
  • After added same dataset on our train dataset our accuracy, precision, AUC(except Naive Bayes) are decrease. But our recall score are increase especially in Naive Bayes.
  • Based on Recall Score our best Algorithm is Naive Bayes based on second step test results.

Step 4 : Added New Tagged Data on My Train Dataset

  • I add new data and apply same cleaning and nlp process on them.
In [108]:
len(df_others)
label_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
df_others['none'] = 1-df_others[label_cols].max(axis=1)
df_others['any'] = df_others[label_cols].max(axis=1)
In [242]:
df_others_any_1 = df_others.loc[(df_others['any'] == 1)]
df_others_any_1.ix[:, ["comment_text","any","pred"]]
df_others_any_1 = df_others_any_1.reset_index(drop=True)
In [243]:
len(df_others_any_1)
Out[243]:
14187
In [244]:
new_train_2 = pd.concat([train_others, df_others_any_1])
len(new_train_2)
Out[244]:
30187
In [245]:
train_rows = (len(new_train_2))
test_rows = (len(test))
train_rows = round(train_rows)
train = df.iloc[:train_rows,]
train_2=new_train_2.iloc[:,0:2]
test_2=test.iloc[:,0:2]
df = pd.concat([train_2,test_2])
df_2=df.iloc[:,0:2]
df_2=df_2.reset_index(drop=True)
print("- I have ",train_rows, " rows on my new regenerated train set ")
print("- I have ",test_rows, " rows on my test set ")
- I have  30187  rows on my new regenerated train set 
- I have  4000  rows on my test set 
In [246]:
corpus=df_2.comment_text
train_text = train_2.comment_text
test_text = test_2.comment_text

clean_corpus=corpus.apply(lambda x :clean(x))

tfv = TfidfVectorizer(min_df=200,  max_features=10000, 
            strip_accents='unicode', analyzer='word',ngram_range=(1,1),
            use_idf=1,smooth_idf=1,sublinear_tf=1,
            stop_words = 'english')
tfv.fit(clean_corpus)
features = np.array(tfv.get_feature_names())

df_unigrams =  tfv.transform(clean_corpus.iloc[:df.shape[0]])

#serperate train and test features
df_feats=df.iloc[0:len(df),]
#join the tags
df_tags=df.iloc[:,2:]
df_feats=pd.concat([df_feats,df_tags],axis=1)

tfidf_top_n_per_lass=top_feats_by_class(df_unigrams,features)

word_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}',
    stop_words='english',
    ngram_range=(1, 1),
    max_features=10000)
word_vectorizer.fit(corpus)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)

char_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='char',
    stop_words='english',
    ngram_range=(2, 6),
    max_features=50000)
char_vectorizer.fit(corpus)
train_char_features = char_vectorizer.transform(train_text)
test_char_features = char_vectorizer.transform(test_text)

Logistic Regression with Addition of New Any = 1 Dataset

In [247]:
train_features = hstack([train_char_features, train_word_features])
test_features = hstack([test_char_features, test_word_features])
train_target = new_train_2[class_name]

test_pred = pd.DataFrame.from_dict({'id': test['id']})
train_pred = pd.DataFrame.from_dict({'id': new_train_2['id']})


classifier = LogisticRegression(C=5, solver='sag')
time1=time.time()
classifier.fit(train_features, train_target)
time2=time.time()
train_pred[class_name] = classifier.predict_proba(train_features)[:, 1]   
time3=time.time()
test_pred[class_name] = classifier.predict_proba(test_features)[:, 1]
time4=time.time()


test["pred"]=test_pred["any"]>0.5
new_train_2["pred"]=train_pred["any"]>0.5
print("Training time = ",round(time2-time1,5))
print("Test time = ",round(time4-time3,5))
print("Accuracy score of train set : ",round(accuracy_score(new_train_2["any"], new_train_2["pred"]),5))
print("Accuracy score of test set : ",round(accuracy_score(test["any"], test["pred"]),5))
print("Precision score of train set : ",round(precision_score(new_train_2["any"], new_train_2["pred"]),5))
print("Precision score of test set : ",round(precision_score(test["any"], test["pred"]),5))
print("Recall score of train set : ",round(recall_score(new_train_2["any"], new_train_2["pred"]),5))
print("Recall score of test set : ",round(recall_score(test["any"], test["pred"]),5))
Training time =  13.00684
Test time =  0.54038
Accuracy score of train set :  0.98821
Accuracy score of test set :  0.9255
Precision score of train set :  0.98789
Precision score of test set :  0.57724
Recall score of train set :  0.98964
Recall score of test set :  0.90331
In [248]:
# calculate the fpr and tpr for all thresholds of the classification
from sklearn import metrics
preds = test["pred"]
fpr, tpr, threshold = metrics.roc_curve(test["any"], test_pred["any"])
fpr_t, tpr_t, threshold_t = metrics.roc_curve(new_train_2["any"], train_pred["any"])
roc_auc = metrics.auc(fpr, tpr)
roc_auc_t = metrics.auc(fpr_t, tpr_t)

# method I: plt
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'TEST_AUC = %0.5f' % roc_auc)
plt.plot(fpr_t, tpr_t, 'r', label = 'TRAIN_AUC = %0.5f' % roc_auc_t)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Naive Bayes Classifier with Addition of New Any = 1 Dataset

In [249]:
classifier = MultinomialNB()
time1=time.time()
classifier.fit(train_features, train_target)
time2=time.time()
train_pred[class_name] = classifier.predict_proba(train_features)[:, 1]   
time3=time.time()
test_pred[class_name] = classifier.predict_proba(test_features)[:, 1]
time4=time.time()

test["pred"]=test_pred["any"]>0.5
new_train_2["pred"]=train_pred["any"]>0.5
print("Training time = ",round(time2-time1,5))
print("Test time = ",round(time4-time3,5))
print("Accuracy score of train set : ",round(accuracy_score(new_train_2["any"], new_train_2["pred"]),5))
print("Accuracy score of test set : ",round(accuracy_score(test["any"], test["pred"]),5))
print("Precision score of train set : ",round(precision_score(new_train_2["any"], new_train_2["pred"]),5))
print("Precision score of test set : ",round(precision_score(test["any"], test["pred"]),5))
print("Recall score of train set : ",round(recall_score(new_train_2["any"], new_train_2["pred"]),5))
print("Recall score of test set : ",round(recall_score(test["any"], test["pred"]),5))
Training time =  0.77612
Test time =  0.0625
Accuracy score of train set :  0.88873
Accuracy score of test set :  0.866
Precision score of train set :  0.89335
Precision score of test set :  0.41417
Recall score of train set :  0.89464
Recall score of test set :  0.87786
In [250]:
# calculate the fpr and tpr for all thresholds of the classification
from sklearn import metrics
preds = test["pred"]
fpr, tpr, threshold = metrics.roc_curve(test["any"], test_pred["any"])
fpr_t, tpr_t, threshold_t = metrics.roc_curve(new_train_2["any"], train_pred["any"])
roc_auc = metrics.auc(fpr, tpr)
roc_auc_t = metrics.auc(fpr_t, tpr_t)

# method I: plt
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'TEST_AUC = %0.5f' % roc_auc)
plt.plot(fpr_t, tpr_t, 'r', label = 'TRAIN_AUC = %0.5f' % roc_auc_t)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

AdaBoost Classifier with Addition of New Any = 1 Dataset

In [251]:
classifier = AdaBoostClassifier()
time1=time.time()
classifier.fit(train_features, train_target)
time2=time.time()
train_pred[class_name] = classifier.predict_proba(train_features)[:, 1]   
time3=time.time()
test_pred[class_name] = classifier.predict_proba(test_features)[:, 1]
time4=time.time()

test["pred"]=test_pred["any"]>0.5
new_train_2["pred"]=train_pred["any"]>0.5
print("Training time = ",round(time2-time1,5))
print("Test time = ",round(time4-time3,5))
print("Accuracy score of train set : ",round(accuracy_score(new_train_2["any"], new_train_2["pred"]),5))
print("Accuracy score of test set : ",round(accuracy_score(test["any"], test["pred"]),5))
print("Precision score of train set : ",round(precision_score(new_train_2["any"], new_train_2["pred"]),5))
print("Precision score of test set : ",round(precision_score(test["any"], test["pred"]),5))
print("Recall score of train set : ",round(recall_score(new_train_2["any"], new_train_2["pred"]),5))
print("Recall score of test set : ",round(recall_score(test["any"], test["pred"]),5))
Training time =  152.14412
Test time =  0.67949
Accuracy score of train set :  0.86878
Accuracy score of test set :  0.889
Precision score of train set :  0.90402
Precision score of test set :  0.46403
Recall score of train set :  0.83887
Recall score of test set :  0.83715
In [252]:
# calculate the fpr and tpr for all thresholds of the classification
from sklearn import metrics
preds = test["pred"]
fpr, tpr, threshold = metrics.roc_curve(test["any"], test_pred["any"])
fpr_t, tpr_t, threshold_t = metrics.roc_curve(new_train_2["any"], train_pred["any"])
roc_auc = metrics.auc(fpr, tpr)
roc_auc_t = metrics.auc(fpr_t, tpr_t)

# method I: plt
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'TEST_AUC = %0.5f' % roc_auc)
plt.plot(fpr_t, tpr_t, 'r', label = 'TRAIN_AUC = %0.5f' % roc_auc_t)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Step 2, 3 and 4 Results

In [114]:
Image(filename = PATH + "results3.png", width=800, height=600)
Out[114]:
  • Based on the steps 4 test results best algorithms is Logistic Regression for all metrics.
  • So we can say logistic regression can not learn repeated data very effectively but it can learn new data really good.
  • Also when we added new data our precision and accuracy score are decreased.
  • But our recall, AOC and gini coefficients (2*AOC-1) are increased.
  • Result of step 2 and step 3 we can say, data manipulation on our train dataset can give effective results on out test dataset metrics. Also it give different size of effects for all algorithms(For example Logistic regression can not learn repeated data compared the other two).
  • Result of step 2 and step 4 we can say, adding new data on train dataset can change test dataset metrics in a good way or bad way.
  • Also Cause of the accuracy decrease of step 3,4 test set, their test and train dataset data distribution are so different.

Reference